[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310

beliefer · 2023-10-10T12:39:03Z

What changes were proposed in this pull request?

#41507 supported the new parquet compression codec lz4raw. But lz4raw is not a correct parquet compression codec name.

This mistake causes error. Please refer https://github.com/apache/spark/pull/43310/files#r1352405312

The root cause is parquet uses lz4_raw as its name and store it into the metadata of parquet file. Please refer

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala

Line 65 in 6373f19

private def getTableCompressionCodec(path: String): Seq[String] = {

We should use lz4_raw as its name.

Why are the changes needed?

Fix the bug that uses incorrect parquet compression codec lz4raw.

Does this PR introduce any user-facing change?

'Yes'.
Fix a bug.

How was this patch tested?

New test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

…n codec lz4raw

beliefer · 2023-10-10T12:41:57Z

.../apache/spark/sql/execution/datasources/parquet/ParquetCompressionCodecPrecedenceSuite.scala

@@ -105,7 +114,7 @@ class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSpar

  test("Create parquet table with compression") {
    Seq(true, false).foreach { isPartitioned =>
-      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4")
+      val codecs = Seq("UNCOMPRESSED", "SNAPPY", "GZIP", "ZSTD", "LZ4", "LZ4_RAW")


Before this fix, if we use LZ4RAW here, the test case will be failed!

realCompressionCodecs.forall(((x$1: String) => x$1.==(compressionCodec))) was false ScalaTestFailureLocation: org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite at (ParquetCompressionCodecPrecedenceSuite.scala:101) org.scalatest.exceptions.TestFailedException: realCompressionCodecs.forall(((x$1: String) => x$1.==(compressionCodec))) was false at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$checkCompressionCodec$2(ParquetCompressionCodecPrecedenceSuite.scala:101) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:306) at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:304) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.withTable(ParquetCompressionCodecPrecedenceSuite.scala:30) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$checkCompressionCodec$1(ParquetCompressionCodecPrecedenceSuite.scala:96) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$checkCompressionCodec$1$adapted(ParquetCompressionCodecPrecedenceSuite.scala:94) at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:79) at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:78) at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:245) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(ParquetCompressionCodecPrecedenceSuite.scala:30) at org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:78) at org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:77) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.withTempDir(ParquetCompressionCodecPrecedenceSuite.scala:30) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.checkCompressionCodec(ParquetCompressionCodecPrecedenceSuite.scala:94) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$10(ParquetCompressionCodecPrecedenceSuite.scala:110) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$10$adapted(ParquetCompressionCodecPrecedenceSuite.scala:109) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$9(ParquetCompressionCodecPrecedenceSuite.scala:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$9$adapted(ParquetCompressionCodecPrecedenceSuite.scala:107) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite.$anonfun$new$8(ParquetCompressionCodecPrecedenceSuite.scala:107)

If this is only a test utility method, checkCompressionCodec, failure? Or, is there any user-facing issue?

Yes. checkCompressionCodec failed! It's a test utility method and unrelated to user-facing.

LuciferYang · 2023-10-10T14:33:37Z

cc @wangyum FYI

beliefer · 2023-10-11T05:50:44Z

cc @dongjoon-hyun @cloud-fan

dongjoon-hyun · 2023-10-11T05:57:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCodecSuite.scala

@@ -59,7 +59,7 @@ class ParquetCodecSuite extends FileSourceCodecSuite {
  // Exclude "brotli" because the com.github.rdblue:brotli-codec dependency is not available
  // on Maven Central.
  override protected def availableCodecs: Seq[String] = {
-    Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4raw")
+    Seq("none", "uncompressed", "snappy", "gzip", "zstd", "lz4", "lz4_raw")


In this case, it succeeds currently. Do you know the difference, @beliefer ?

This test case is success. But https://github.com/apache/spark/pull/43310/files#r1352405312 failed!

dongjoon-hyun · 2023-10-11T05:58:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .version("1.1.1")
    .stringConf
    .transform(_.toLowerCase(Locale.ROOT))
    .checkValues(
-      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4raw", "zstd"))
+      Set("none", "uncompressed", "snappy", "gzip", "lzo", "brotli", "lz4", "lz4_raw", "zstd"))


Well, unfortunately, we cannot do like this because Apache Spark 3.5.0 is already released.

I know that. But 3.5.0 released latest, could we fix it ?

No, if we remove this here, the production job with the existing configuration fails with Spark 3.5.1.

dongjoon-hyun · 2023-10-11T05:59:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -96,7 +96,7 @@ object ParquetOptions extends DataSourceOptions {
    "lzo" -> CompressionCodecName.LZO,
    "brotli" -> CompressionCodecName.BROTLI,
    "lz4" -> CompressionCodecName.LZ4,
-    "lz4raw" -> CompressionCodecName.LZ4_RAW,


Ditto. We cannot delete like this. Only we can add like next line for backward-compatibility.

This is a good idea.

wangyum · 2023-10-11T07:08:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1014,12 +1014,12 @@ object SQLConf {
      "`parquet.compression` is specified in the table-specific options/properties, the " +
      "precedence would be `compression`, `parquet.compression`, " +
      "`spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, " +
-      "snappy, gzip, lzo, brotli, lz4, lz4raw, zstd.")
+      "snappy, gzip, lzo, brotli, lz4, lz4_raw, zstd.")


What is the suffix of the file after changing to LZ4_RAW?

Before this PR:
part-00000-fc07d464-03b2-42d6-adc1-68a3adca1752.c000.lz4raw.parquet
After this PR:
part-00000-07244014-f31a-4097-8878-dd3630e721ce.c000.lz4raw.parquet

Since there is no change at file name layer, it's good.

beliefer · 2023-10-11T07:30:50Z

@dongjoon-hyun @wangyum Because Spark 3.5.0 released. The name lz4raw already exposed. So we reserve the lz4raw for backward-compatibility and add the correct lz4_raw. We can deprecate lz4raw until some Spark version and remove it in future.

dongjoon-hyun · 2023-10-11T08:36:39Z

Yes, I guess we can deprecated it at Spark 3.5.1 with a warning and remove it at Apache Spark 4.0.0. Since it's a fairly new config value, there will be not many users.

beliefer · 2023-10-11T08:47:08Z

Yes, I guess we can deprecated it at Spark 3.5.1 with a warning and remove it at Apache Spark 4.0.0. Since it's a fairly new config value, there will be not many users.

I will create a PR for branch-3.5 to deprecate lz4raw and add lz4_raw.

…codec lz4raw ### What changes were proposed in this pull request? According to the discussion at #43310 (comment), this PR want deprecates the incorrect parquet compression codec `lz4raw` at Spark 3.5.1 and adds a warning log. The warning log prompts users that `lz4raw` will be removed it at Apache Spark 4.0.0. ### Why are the changes needed? Deprecated the incorrect parquet compression codec `lz4raw`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users will see the waring log below. `Parquet compression codec 'lz4raw' is deprecated, please use 'lz4_raw'` ### How was this patch tested? Exists test cases and new test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43330 from beliefer/SPARK-45484_3.5. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Jiaan Geng <[email protected]>

beliefer · 2023-10-17T01:58:24Z

#43330 merged.
@dongjoon-hyun @wangyum @LuciferYang Could you review this PR again?

beliefer · 2023-10-18T08:02:42Z

@dongjoon-hyun Could you have time to take a review?

beliefer · 2023-10-19T01:24:21Z

cc @srowen

beliefer · 2023-10-20T01:44:53Z

Merged. Thanks @dongjoon-hyun @srowen @HyukjinKwon @wangyum @LuciferYang

…pression codec ### What changes were proposed in this pull request? This PR follows up #43310 to update the document of parquet compression codec. ### Why are the changes needed? Update the document of parquet compression codec. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43464 from beliefer/SPARK-45484_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Sean Owen <[email protected]>

[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compressio…

3d6cd81

…n codec lz4raw

github-actions bot added the SQL label Oct 10, 2023

beliefer commented Oct 10, 2023

View reviewed changes

beliefer requested a review from wangyum October 11, 2023 03:27

dongjoon-hyun reviewed Oct 11, 2023

View reviewed changes

wangyum reviewed Oct 11, 2023

View reviewed changes

beliefer mentioned this pull request Oct 11, 2023

[SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw #43330

Closed

beliefer requested review from dongjoon-hyun and wangyum October 17, 2023 11:20

srowen approved these changes Oct 19, 2023

View reviewed changes

github-actions bot added the DOCS label Oct 19, 2023

Add migration guide

8272a4a

beliefer force-pushed the SPARK-45484 branch from 0a6b390 to 8272a4a Compare October 19, 2023 12:33

HyukjinKwon approved these changes Oct 20, 2023

View reviewed changes

beliefer closed this in b9ac7d3 Oct 20, 2023

beliefer mentioned this pull request Oct 20, 2023

[SPARK-45484][SQL][FOLLOWUP][DOCS] Update the document of parquet compression codec #43464

Closed

beliefer mentioned this pull request Oct 23, 2023

[SPARK-45481][SQL] Introduce a mapper for parquet compression codecs #43308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310

[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310

beliefer commented Oct 10, 2023 •

edited

Loading

beliefer Oct 10, 2023

dongjoon-hyun Oct 11, 2023

beliefer Oct 11, 2023

LuciferYang commented Oct 10, 2023

beliefer commented Oct 11, 2023

dongjoon-hyun Oct 11, 2023

beliefer Oct 11, 2023

dongjoon-hyun Oct 11, 2023

beliefer Oct 11, 2023

dongjoon-hyun Oct 11, 2023

dongjoon-hyun Oct 11, 2023

beliefer Oct 11, 2023

wangyum Oct 11, 2023

beliefer Oct 11, 2023

dongjoon-hyun Oct 11, 2023

beliefer commented Oct 11, 2023

dongjoon-hyun commented Oct 11, 2023

beliefer commented Oct 11, 2023

beliefer commented Oct 17, 2023

beliefer commented Oct 18, 2023

beliefer commented Oct 19, 2023

beliefer commented Oct 20, 2023

[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310

[SPARK-45484][SQL] Fix the bug that uses incorrect parquet compression codec lz4raw #43310

Conversation

beliefer commented Oct 10, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang commented Oct 10, 2023

beliefer commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Oct 11, 2023

dongjoon-hyun commented Oct 11, 2023

beliefer commented Oct 11, 2023

beliefer commented Oct 17, 2023

beliefer commented Oct 18, 2023

beliefer commented Oct 19, 2023

beliefer commented Oct 20, 2023

beliefer commented Oct 10, 2023 •

edited

Loading